Amazon Apparel Recommendations

Overview of the data

Of these 19 features, we will be using only 6 features

1. asin  ( Amazon standard identification number)
2. brand ( brand to which the product belongs to )
3. color ( Color information of apparel, it can contain many colors as   a value ex: red and black stripes ) 
4. product_type_name (type of the apperal, ex: SHIRT/TSHIRT )
5. medium_image_url  ( url of the image )
6. title (title of the product.)
7. formatted_price (price of the product)

We do this because,'author', 'publisher', 'availability', 'large_image_url', 'availability_type', 'small_image_url', 'editorial_review', 'model', 'medium_image_url', 'manufacturer', 'editorial_reivew' are quite irrelevant during recommendation. We got to know this through asking our friends on what they think is the feature that they look at when they scroll through the product

Missing data for various features.

Basic stats for the feature: product_type_name

We have total 72 unique type of product_type_names
91.62% (167794/183138) of the products are shirts

As we can see above, 'SHIRT' appeared a lot of times in the data (167k times) we procured. This is expected as shirt data is what we queried.

Basic stats for the feature: brand

There are 10577 unique brands
183138 - 182987 = 151 missing values.

Basic stats for the feature: color

we have 7380 unique colors
7.2% of products are black in color
64956 of 183138 products have brand information. That's approx 35.4%.

Basic stats for the feature: formatted_price

Only 28,395 (15.5% of whole data) products with price information

Basic stats for the feature: title

We brought down the number of data points from 183K to 28K.

Removing near duplicate items

Understanding the duplicates.

we have 2325 products which have same title but different color

These shirts are exactly same except in size (S, M,L,XL)

:B00AQ4GMCK :B00AQ4GMTS
:B00AQ4GMLQ :B00AQ4GN3I

These shirts exactly same except in color

:B00G278GZ6 :B00G278W6O
:B00G278Z2A :B00G2786X8

In our data there are many duplicate products like the above examples, we need to de-dupe them for better results.

Removing duplicates

After looking at the data thoroughly, here are some examples of dupliacte titles that differed only in the last few words.

Titles 1:
16. woman's place is in the house and the senate shirts for Womens XXL White
17. woman's place is in the house and the senate shirts for Womens M Grey

Title 2:
25. tokidoki The Queen of Diamonds Women's Shirt X-Large
26. tokidoki The Queen of Diamonds Women's Shirt Small
27. tokidoki The Queen of Diamonds Women's Shirt Large

Title 3:
61. psychedelic colorful Howling Galaxy Wolf T-shirt/Colorful Rainbow Animal Print Head Shirt for woman Neon Wolf t-shirt
62. psychedelic colorful Howling Galaxy Wolf T-shirt/Colorful Rainbow Animal Print Head Shirt for woman Neon Wolf t-shirt
63. psychedelic colorful Howling Galaxy Wolf T-shirt/Colorful Rainbow Animal Print Head Shirt for woman Neon Wolf t-shirt
64. psychedelic colorful Howling Galaxy Wolf T-shirt/Colorful Rainbow Animal Print Head Shirt for woman Neon Wolf t-shirt

We removed the dupliactes which differ only at the end.


In the previous cell, we sorted whole data in alphabetical order of  titles.Then, we removed titles which are adjacent and very similar title

But there are some products whose titles are not adjacent but very similar.

Examples:

Titles-1
86261.  UltraClub Women's Classic Wrinkle-Free Long Sleeve Oxford Shirt, Pink, XX-Large
115042. UltraClub Ladies Classic Wrinkle-Free Long-Sleeve Oxford Light Blue XXL

TItles-2
75004.  EVALY Women's Cool University Of UTAH 3/4 Sleeve Raglan Tee
109225. EVALY Women's Unique University Of UTAH 3/4 Sleeve Raglan Tees
120832. EVALY Women's New University Of UTAH 3/4-Sleeve Raglan Tshirt

6. Text pre-processing

We tried using stemming on our titles and it did not work very well.

The recommendations given was not upto the mark through stemming

Text based product similarity

Bag of Words (BoW) on product titles.

12566 is the index of the product we want recommendations for, we use this index as the base index to check the results of all the recommendation engines we build

TF-IDF based product similarity

IDF based product similarity

Text Semantics based product similarity

Average Word2Vec product similarity.

IDF weighted Word2Vec for product similarity

Weighted similarity using brand and color.

Keras and Tensorflow to extract features

Visual features based product similarity.